Lecture 2
This lecture taught by of Prof. Cathy Yi-Hsuan Chen focuses on introducing importing packages, reading and writing files, using pandas to read and write structured data.
Specifically, the code can be found in the Github
Outlines
Pandas
- Pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python
- It provides two workhorse data structures: Series and DataFrame.
Series
- Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.
import pandas as pd
Series1 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
- You can create a series from a dictionary object
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
Series3 = pd.Series(sdata)
Data-Frame
- DataFrame represents a tabular, spreadsheet-like data structure containing an or- dered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.)
- DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).
- There are numerous ways to construct a DataFrame
# Way 1: a dict of equal-length lists or NumPy arrays
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame1 = pd.DataFrame(data)
# you can order the columns
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
# Way 2: a nested dict of dicts format
data = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 =pd.DataFrame(data)
- generate date sereis
dates = pd.date_range('1/1/2020', periods=10)
df = pd.DataFrame(np.random.randn(10, 5),index=dates, columns=['A', 'B', 'C', 'D','E'])
- Indexing, selection, and filtering in DataFrame
import numpy as np # numpy is fundamental package for scientific computing
data = pd.DataFrame(np.random.randn(4, 4),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
# please implementing the following selection
data['two']
data['two'][0]
data[['three', 'one']]
data[:2]
data[2:]
data[data['three'] > 0.1]
# please implementing the following filtering
data[data < 0.1] = 0
data.ix['Colorado', ['two', 'three']]
data.ix[['Colorado', 'Utah'], [3, 0, 1]]
data.ix[data.three > 0.1, :3]
# loc: Access a group of rows and columns by label(s) or a boolean array.
data.loc['Ohio']
# Single label for row and column
data.loc['Ohio','three']
# iloc: Purely integer-location based indexing for selection by position
data.iloc[0]
- you can drop NA from Dataframe: dropna()
- you can fill NA as 0: fillna(0)
Data Input/Output
Reading and output text files
When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
using stock of AAPL as an exmaple, find data here, and save it into your working directory
apple_stock = pd.read_csv('AAPL.csv', index_col='date', parse_dates=True)
# slicing
apple_stock_2013 = apple_stock.loc[apple_stock.index.year == 2013, ['low', 'high', 'open', 'close', 'volume']]
# sorting
apple_stock_2013.sort_values(by='volume', ascending=False, inplace=True)
# Save the new data as json or csv format
apple_stock_2013.to_json('AAPL_2013.json')
apple_stock_2013.to_csv('test.csv')
OS module
The OS module in Python provides a way of using operating system dependent functionality
using corpus of shakespeare as an exmaple, find text here, and save it into your working directory
import os
path_direct = os.getcwd()
os.chdir(path_direct + '/course')
# Using build-in function, open(), to open the file and using close() to close the file
shakespeare = open('shakespeare.txt', 'r', encoding='utf-8')
for string in shakespeare:
print(string)
shakespeare.close()
- Way 1: read strings
with open('shakespeare.txt', 'r') as shakespeare_read:
# read(n) method will put n characters into a string
shakespeare_string_10 = shakespeare_read.read(10)
shakespeare_string = shakespeare_read.read()
- Way 2: read single line
with open('shakespeare.txt', 'r') as shakespeare_read:
# readline() method will read one line once.
print(shakespeare_read.readline(), end='*')
print(shakespeare_read.readline(), end='*')
print(shakespeare_read.readline(), end='*')
- Way 3: read multiple lines, and create a list of strings
with open('shakespeare.txt', 'r') as shakespeare_read:
# readlines() method will put content into a list, every line is a string in the list
shakespeare_lines = shakespeare_read.readlines()
print(shakespeare_lines)